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Abstract 

Background: We study the sparsification of dynamic programming folding algorithms of RNA structures. 
Sparsification applies to the mfe-folding of RNA structures and can lead to a significant reduction of time com- 
plexity. 

Results: We analyze the sparsification of a particular decomposition rule, A*, that splits an interval for RNA 
secondary and pseudoknot structures of fixed topological genus. Essential for quantifying the sparsification is the 
size of its so called candidate set. We present a combinatorial framework which allows by means of probabilities of 
CsJ . irreducible substructures to obtain the expected size of the set of A*-candidates. We compute these expectations 

for arc-based energy models via energy-filtered generating functions (GF) for RNA secondary structures as well 
. as RNA pseudoknot structures. For RNA secondary structures we also consider a simplified loop-energy model. 

■ This combinatorial analysis is then compared to the expected number of A*-candidates obtained from folding 

mfe-structures. In case of the mfe-folding of RNA secondary structures with a simplified loop energy model our 
results imply that sparsification provides a reduction of time complexity by a constant factor of 91% (theory) 
versus a 96% reduction (experiment). For the "full" loop-energy model there is a reduction of 98% (experiment). 

Conclusions: Our result show that the polymer-zeta property, describing the probability of an irreducible 
structure over an interval of length rn does not hold for RNA structures. As a result sparsification of the A*- 
decomposition rule does not lead to a linear reduction of the set of candidates. We show that under general 
assumptions the expected number of A*-candidates is G(n^), the constant reduction being in the range of 95%. 
The sparsification of the A*-decomposition rule for RNA pseudoknotted structures of genus 1 leads to an expected 
number of candidates of 8(n^). The effect of sparsification is sensitive to the employed energy model. 
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Background 

An RNA sequence is a linear, oriented sequence of the nucleotides (bases) A,U,G,C. These sequences "fold" 
by establishing bonds between pairs of nucleotides. Bonds cannot form arbitrarily: a nucleotide can at most 
establish one Watson-Crick base pair A-U or G-C or a wobble base pair U-G, and the global conformation 
of an RNA molecule is determined by topological constraints encoded at the level of secondary structure, 
i.e., by the mutual arrangements of the base pairs [T]. 

Secondary structures can be interpreted as (partial) matchings in a graph of permissible base pairs [5]. 
They can be represented as diagrams, i.e. graphs over the vertices 1, . . . ,n, drawn on a horizontal line with 
bonds (arcs) in the upper halfplane. In this representation one refers to a secondary structure without 
crossing arcs as a simple secondary structure and pseudoknot structure, otherwise, see Figure [TJ 




Figure 1: RNA structures as planar graphs and diagrams. (A) an RNA secondary structure and (B) an RNA 
pseudoknot structure. 

Folded configurations are energetically somewhat optimal. Here energy means free energy, which is 
dominated by the loops forming between adjacent base pairs and not by the hydrogen bonds of the individual 
base pairs [3]. In addition sterical constraints imply certain minimum arc- length conditions for minimum 
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free energy configurations [1]. In particular, only configurations without isolated bonds and without bonds of 
length one (formed by immediately subsequent nucleotides) are observed in RNA structures. In this paper, 
optimize a problem we meas maximize the score but not to minimize the free energy. 

For a given RNA sequence polynomial-time dynamic programming (DP) algorithms can be devised, 
finding such minimal energy configurations. The most commonly used tools predicting simple RNA secondary 
structure mf old [S] and the Vienna RNA Package [B], are running at 0{N'^) space and 0{N^) time solution. 
In the following we omit "simple" and refer to secondary structures containing crossing arcs as pscudoknot 
structures. 

Generalizing the matrices of the DP-routines of secondary structure folding [5][6] to gap-matrices [7] , leads 
to a DP-folding of pscudoknottcd structures [7] (pknot-R&E) with 0{n^) space an 0{n^) time complexity. 
The following references provide a certainly incomplete list of DP-approaches to RNA pseudoknot structure 
prediction using various structure classes characterized in terms of recursion equations and/or stochastic 
grammars: [7l-[T9]. The most efficient algorithm for pseudoknot structures is [M] (pknotsRG) having 0{n?) 
space and 0(ri,^) time complexity. This algorithm however considers only a few types of pseudoknots. 

RNA secondary structures are exactly structures of topological genus zero [20]. The topological classifi- 
cation of RNA structures pT| - |23] has recently been translated into an efficient DP algorithm [19] . Fixing the 
topological genus of RNA structures implies that there are only finitely many types, the so called irreducible 
shadows [23] . 

Sparsification is a method tailored to speed up DP-algorithms predicting mfe-secondary structures [24ll25] . 
The idea is to prune certain computation paths encountered in the DP-recursions, sec Figure [U To make 
the key point, let us consider the case of RNA secondary structure folding. Here sparsification reduces the 
DP-recursion paths to be based on so called candidates. A candidate is in this case an interval, for which 
the optimal solution cannot be written as a sum of optimal solutions of sub-intervals, see Figure [31 Tracing 
back these candidates gives rise to "irreducible" structures and the crucial observation is here that these 
irreducibles appear only at a low rate. This means that there are only relatively few candidates, which in 
turn implies a significant reduction in time and space complexity. 

Sparsification has been applied in the context of RNA-RNA interaction structures [55] as well as RNA 
pscudoknot structures [27] . In difference to RNA secondary structures, however, not every decomposition 
rule in the DP-folding of RNA pseudoknot structures is amendable to sparsification. By construction, spar- 
sification can only be applied for calculating mfe-energy structures. Since the computation of the partition 
function [T2J[1S] needs to take into account all sub-structures, sparsification does not work. 
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Figure 2: Sparsification of secondary structure folding. Suppose the optimal solution Lij is obtained from the 
optimal solutions I/i,fe, Lk+i,q and Lq+ij. Based on the recursions of the secondary structures, Li,fc and Lk+i,q 
produce an optimal solution of Li^q. Similarly, Lk+i,q and produce an optimal solution of Lk.jri.j- Now, in 

order to obtain an optimal solution of L^j it is sufficient to consider either the grouping Li^q and iq+ij or L^.^ and 
Lk+i,j- 

(c^..^^. €^..€^>.€^. 

i kj i i j i k3 j 

(A) (B) (C) 

Figure 3: What sparsification can and cannot prune: (A) and (B) are two computation paths yielding the same 
optimal solution. Sparsification reduces the computation to path (A) where Si^^i is irreducible. (C) is another 
computation path with distinct leftmost irreducible over a different interval, hence representing a new candidate that 
cannot be reduced to (A) by the sparsification. 

For the mfe-folding of RNA secondary structures considerable attention has been paid in order to validate 
that the set of candidates is small. The idea here is that irrcducibles are contained in short, "rainbow" -like 
arcs. To be precise, the gain is 0(n), if secondary structure satisfy the so called polymer-zeta property |29lj30| : 
The latter quantifies the probability of an arc of length rn to be < & m~^, where & > 0, c > 1. Note that 
these arcs confine in case of secondary structures irreducible structures, that is arcs and irreducibility are 
tightly connected. 

In pseudoknotted RNA structures however, we have crossing arcs and the associated notion of irreducible 
structures differs significantly from that of RNA secondary structures. The polymer-zeta property is the- 
oretically justified by means of modeling the 2D folding of a polymer chain as a self-avoiding walk (SAW) 
in a 2D lattice [31]. More evidence of the polymer-zeta property for RNA secondary structures has been 
collected via the NCBI database [35] of mfe-RNA structures. 

In this paper we study the sparsification of the decomposition rule A* that splices an interval [251I17] in 
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the context of the DP-folding of RNA pscudoknot structures of fixed topological genus. Our paper provides 
a combinatorial framework to quantify the effects of sparsifying the A*-decomposition rule. 

We shall prove that the candidate set [531[5SJ[17| is indeed small. Our argument is based on assuming a 
specific distribution of irreducible structures within mfe-structures. Namely we assume these irrcducibles to 
appear with probability f*(n, j)/f(ri, j), where we assume e to be a fixed parameter and F(z, e) = ^ f„.jz"e-' 
to be a bivariate (energy-filtered) generating function whose associated generation function of irrcducibles 
is F*(z,e)-Efn*,,^"e^'- 

While this energy-filtration seems to be reparameterization of the notion of "stickiness" [33], it is really 
fundamentally different. This becomes clear when considering loop-based energies which distinguishes energy 
and arcs. Clearly when folding random sequences one weights the latter around 6/16, reminiscent of the 
probability of two given positions to be compatible. The energy however is fairly independent as it really 
depends on the particular loop-type. 

We obtain these energy-filtered GFs also for RNA pscudoknot structures of fixed topological genus. This 
provides new insights into the improvements of the sparsification of the concatenation-rule A* in the presence 
of cross serial interactions. Our observations complement the detailed analysis of Backofen [251127] . We show 
that although for pscudoknot structures of fixed topological genus [22l[23] the effect of sparsification on 
the global time complexity is still unclear, the decomposition rule that splits an interval can be sped up 
significantly. 

Sparsification 

The general idea of sparsification [^I25l[?7] is following: let V = {vi, U2, . . .} be a set whose elements Vi are 
unions of pairwise disjoint intervals. Let furthermore Ly denote an optimal solution (a positive number or 
score) of the DP-routine over v. By assumption Ly is recursively obtained. Suppose the optimal solution Ly 
is given by Ly — Ly^ + Ly^ + Ly^, where v ~ wiUf2Uu3. Then, under certain circumstances, the DP-routine 
may interpret Ly either as {Ly-^ + Ly^) -\- Ly^ or as Ly^ + (Ly^ + Ly^), see Figure |3J To be precise, this 
situation is encountered iff 

• there exists an optimal solution Ly'^ for a sub-structure over v'l where v'l = Vi(jv2 via A2 and Ly is 
obtained from Ly'_^ and Ly^ via Ai, 

• there exists an optimal solution Lyi^ for a sub-structure over where = i'2Uw3 via A3 and Ly is 
obtained by Ly-^ and Lyi via Ai. 
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Given a decomposition 



A2 



Ai 



we call A2 s-compatible to Ai if there exists a decomposition rule A3 such that 




Note that if A2 is s-compatible to Ai then A3 is s-compatible to Ai. To summarize 

Definition 1. (s-compatible) Suppose is the optimal solution for S^, over u, — Ly'^ + Ly^ under 
decomposition rule Ai. i„j is obtained from two optimal solutions Ly-^ and Ly^ under rule A2. Then A2 is 
called s-compatible to Ai if there exist some rule A3 such that Ly> ~ Ly., -\- Ly^ and Ly — Ly-^ + Lyi . 





Figure 4: Sparsification: Lv is alternatively realized via L„j and or L^i^ and L^^. Thus it is sufficient to only 
consider one of the computation paths. 

Figure |4] depicts two such ways that realize the same optimal solution Ly. Sparsification prunes any such 
multiple computations of the same optimal value. 

We next come to the important concept of candidates. The latter mark the essential computation paths 
for the DP-routine. 

Definition 2. (Candidates) Suppose Ly is an optimal solution. We call u is a A- candidate if for any Vi C. v 
obtained by A and v = ViUv2^ we have 

Ly > Ly-^ + Ly^ 

and we shall denote the set of A-candidates set by Q^. 

Lemma 1. \2^\\27l Suppose A2 is s-compatible to Ai then any optimal solution Ly can be obtained via 
A2- candidates. 
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By construction a A2-candidate w is a union of disjoint intervals such that its optimal solution L„ cannot 
be obtained via a A2-splitting. This optimal solution allows to construct a non-unique arc-configuration (sub- 
structure) over V and the above A2-splitting consequently translates into a splitting of this sub-structure. 
This connects the notion of A2-candidates with that of sub-structures and shows that a A2-candidate implies 
an sub-structure that is A2-irredueible. 

In the case of sparsification of RNA secondary structures we have one basic decomposition rule A* acting 
on intervals, namely A* splices an interval into two disjoint, subsequent intervals. The implied notion of a 
A*-irreducible sub-structure is that of a sub-structure nested in an maximal arc, where maximal refers to 
the partial order < {i' iS i' < i A j < j' . This observation relates irredueibility to that or ares and 
following this line of thought [24] identifies a specific property of polymer-chains introduced in [29l|30] to be 
of relevance for the size of candidate sets: 

Definition 3. (Polymer-zeta property) Let P{i,j) denotes the probability of a structure over an interval 
under some decomposition rule A. Then we say A follows the polymer-zeta property if P{i,j) ~ bm~'^ 
for some constant b,c > 0. 

This property is theoretically justified by means of modeling the 2D folding of a polymer chain as a 
self-avoiding walk (SAW) in a 2D lattice |31) . 

RNA secondary structures 

In this section we recall some results of [24ll25] on the sparsification of RNA secondary structures. Secondary 
structure satisfies a simple recursion which gives the optimal solution over by Lij = inax{Vij,Wij}, 
where Vij denotes the optimal solution in which (i,j) is a base pair, and Wi_j denotes the optimal solution 
obtained by adding the optimal solutions of two subsequent intervals, respectively. Note that the optimal 
solution over a single vertex is denoted by Lij. We have the recursion equation for Vij and Wij: 

(Ai) = L,+i,j^i+ f{i,j), 

(A2) Wi,j = max {Lj^fc + Lfc+i.j}, 

i<k<j 

where f{i,j) is the score when («, j) form a base pair, see Figure. [S] In case two positions, i,j in the sequence 
are incompatible then we have f{i,j) = —00. 

An interval is a A*-candidate if the optimal solution over is given by Lij = Vij > Wi_j. Indeed, 
is a candidate iff is in the candidate set of A*, and we denote the set by Q. Suppose the 
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Figure 5: The recursion solving tire optimal solution for secondary structures. 

optimal solution Wi_j is given by Wij = Li^q + Lq^ij and suppose we have Li q = Li_k + ^fc+i- Then since 
[i, q] is not a candidate, Lemma [T] shows that we can compute Wi,j ~ Li^k + L^+ij, where [i, k] is a candidate. 

Accordingly, the recursion for Wij can be based on candidates, i.e. Wij = m&x^i ^^Q{Li^k + Lk+i.j}- 
Clearly, the bottleneck for computing the recursion is the calculation of Wij, which requires 0{n^) time. 
Applying sparsification, this recursion is based on candidates [i,k]. Suppose we have Z such candidates, 
then the time complexity reduces to 0{nZ), since the optimal solution is necessarily based on a candidate. 
Once the latter is identified the expression Lk+ij requires only 0{n) time complexity. In the worst case, Q 
contains 0{n^) elements. 

The polymer-zeta property however implies that the expectation of Z is given by X]r>i ^0 ^ *)~^ 
where b and c are constants and c > 1. We can conclude from the polymer-zeta property that Z = 0{n) 
and accordingly the runtime reduces to 0{n) ■ 0{n) = 0{n^). 

RNA pseudoknot structures 

Sparsification can also be applied to the DP-algorithm folding RNA structures with pseudoknots [22 . In 
contrast to the decomposition rule A* that spliced an interval into two subsequent intervals, we encounter in 
the grammar for pseudoknotted structures additional more complex decomposition rules [7] . As shown in |27j 
there exist some decomposition rules which are not s-compatible and which can accordingly not be sparsified 
at all, see Figure [6l For instance, given a decomposition rule A in pknot-R&E subsequent decomposition 
rules which are s-compatible to A are referred to as split type of A |27j . 

In the following we will study RNA pseudoknot structures of fixed topological genus, see Section Di- 
agrams, surfaces and some generating functions for details. An algorithm folding such pseudoknot 
structures, gf old, has been presented in |19j . The decomposition rules that appear in gf old are reminiscent 
to those of pknot-R&E but as they restrict the genus of sub-structures the iteration of gap- matrices is severely 
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restricted and the effect of sparsification of these decompositions is significantly smaller. 




i k q iikJrq tsikqirtvs 



(A) (B) (C) 

Figure 6: Decomposition rules for pseudoknot structures of fixed genus. (A) three decompositions via the rule A*, 
which is s-compatible to itself. We show that for A* we obtain a linear reduction in time complexity. (B) three 
decomposition rules Ai,A2,A3 where A2,A3 are s-compatible to Ai. A quantification of the candidate set is not 
implied by the polymer-zeta property. (C) three decomposition rules Ai,A2,A3 where A2,A3 are not s-compatible 
to Ai. 

In the following, we restrict our analysis to the decomposition rule A* which splices an interval into two 
subsequent intervals. Expressed in combinatorial language, A* cuts the backbone of an RNA pseudoknot 
structure of fixed genus g over one interval without cutting a bond. 

Methods 

Diagrams and genus filtration 

In this section we recall some facts about diagrams and pass from diagrams to surfaces in order to be able to 
formulate what we mean by an RNA pseudoknot structure of fixed genus g. Most of this section is derived 
from [531133] with the exception of Lemma [5] and Theorem [21 which are new and key for the subsequent 
analysis of A*-candidates. 

A diagram is a labeled graph over the vertex set [71] = {1, . . . , n} in which each vertex has degree < 3, 
represented by drawing its vertices in a horizontal line. The backbone of a diagram is the sequence of 
consecutive integers (1, . . . , n) together with the edges {{i, i + 1} | 1 < i < n — 1}. The arcs of a diagram, 
(i, j), where i < j, are drawn in the upper half-plane. We shall distinguish the backbone edge {i, i -I- 1} from 
the arc + 1), which we refer to as a 1-arc. A stack of length ^ is a maximal sequence of "parallel" arcs, 
(i -f 1, j — 1), . . . , (i -f- — — — 1))) and is also referred to as a ^-stack, see Figure[71 

We shall consider diagrams as fatgraphs, G, that is graphs G together with a collection of cyclic orderings, 
called fattenings, one such ordering on the half-edges incident on each vertex. Each fatgraph G determines an 
oriented surface F{G) [551155] which is connected if G is and has some associated genus g{G) > and number 
r{G) > 1 of boundary components. Clearly, F{G) contains G as a deformation retract [37]. Fatgraphs were 
first applied to RNA secondary structures in [35] and |39| . 

A diagram G hence determines a unique surface F{G) (with boundary). Filling the boundary components 
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Figure 7: RNA structures and diagram representation. A diagram over {1,...,40}. The arcs (1,21) and (11,33) 
are crossing and tire dashed arc (9, 10) is a 1-arc which is not allowed. This structure contains 3 stacks with length 
7, 4 and 6, from left to right respectively. 



with discs we can pass from F{<G) to a surface without boundary. Euler characteristic, Xi ^'Hd genus, g, of 
this surface is given hy x — v — e + r and g = 1 — jX^ respectively, where v,e,r is the number of discs, 
ribbons and boundary components in G, [37| . The genus of a diagram is that of its associated surface without 
boundary and a diagram of genus g is referred to as g-diagram. 

A g-diagram without arcs of the form (i, i + (1-arcs) is called a g-structure. A g-diagram that contains 
only vertices of degree three, i.e. does not contain any vertices not incident to arcs in the upper halfplane, 
is called a g-matching. A stack of length r is a maximal sequence of "parallel" arcs, 

A diagram is called irreducible, if and only if it cannot be split into two by cutting the backbone without 
cutting an arc. 

Let Cg(n) and dg(n) denote the number of g-matchings and g-structures having n-arcs and n vertices, 
respectively, with GF 

oo oo 

CgW^E'^ffW^" D,(z) = Ed,(n)z". 

The GF Cg{z) has been computed the context of the virtual Euler characteristic of the moduli-space of curves 
in (34j and Dg(z) can be derived from Cg{z) by means of symbolic enumeration |23j . The GF of genus zero 
diagrams Co{z) is wellknown to be the GF of the Catalan numbers, i.e., the numbers of triangulations of a 
polygon with {n + 2) sides. 



Co(^) 

As for g > 1 we have the following situation [53] 



2z 



Theorem 1. Suppose g > 1- Then the following assertions hold 
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(a) Dg(z) is algebraic and 

DffW - Cg ( A. (1) 

In particular, we have for some constant Og depending only on g and 7 w 2.618; 

[z"]D,(z)^a,n3(f-i)7". (2) 

(b) the hivariate GF of g- structures over n vertices, containing exactly m arcs, Eg(z,<), is given by 

I / 

tz^-z + 1 ^\{tz^~z + l) 



Irreducible g-structures 

In the context of A*-candidates we observed that irreducible substructures are of key importance. It is 
accordingly of relevance to understand the combinatorics of these structures. To this end let T)*{z) = 
^^QD*(n)z" denote the GF of irreducible g-structures. 

Lemma 2. For g > 0, the GF Dg(-z) satisfies the recursion 
DSW = 



Do(z) 

(1 



(DS(3) - l)D,(z) + E^ii iz)-D9-9, (z) 



Do(z) 

For a proof of Lemma [2l see Section Proofs. 
Theorem 2. For g > 1 we have 

(a) the GF of irreducible g-structures over n vertices is given by 

where u ~ (z'^-z+iy^ ' ^g^-^) '^'^'^ ^gC^) '^'^^ both polynomials with lowest degree at least 2g, and Ug(l/4), 
Vg(l/4) 7^ 0. In particular, for some constant kg > and 7 w 2.618.' 

D;(n) - fcgn3(f-5)7". (5) 

(b) the hivariate GF of irreducible g-structures over n vertices, containing exactly m arcs, Eg(z, t), is given 
by 

where v = a*^ , ^^a ■ 

We shall postpone the proof of Theorem [5] to Section Proofs. 
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The main result 

In Section Sparsification we observed that sparsification applies to the decomposition rule A*, which effec- 
tively splices off an irreducible sub-structure (diagram). This notion of A*-irreducibility is indeed compatible 
by the notion of combinatorial irreducibility introduced in Section Diagrams, surfaces and some gener- 
ating functions, see Figure [S] An optimal solution for the original structure is obtained from an optimal 
solution of the spliced, A*-irreducible, sub-structure and an optimal solution for the remaining sub-structure. 




Figure 8: Irreducibility relative to a decomposition rule: the rule A* splitting Sij to Si^k and Sk+i,j, Si, 40 is not 
A*-irreducible, while Si, 25 and S28,40 are. However, for the decomposition rule A2, which removes the outmost arc, 
S28,40 is not A2 -irreducible while Si, 25 is. 

Folded configurations arc energetically optimal and dominated by the stacking of adjacent base pairs [3], 
as well as minimum arc-length conditions [4] discussed before. 

In the following we mimic some form of minimum free energy ^-structures: inspired by the Nussinov 
energy model |40| we consider the weight of a (7-structure over n vertices to be given by t]^, where i is the 
number of arcs for some i] > 1 [33j . Note that the case rj ~ 1 corresponds to the uniform distribution, i.e. all 
g-structure have identical weight. 

This approach requires to keep track of the number of arcs, i.e. we need to employ bivariate GF. In 
Theorem [T] (b) we computed this bivariate GF and in Theorem [5] (b) we derived from this bivariate GF 
E*(2;,t), the GF of irreducible g-structures over n vertices containing £ arcs. 

The idea now is to substitute for the second indeterminant, t, some fixed 77 e R. This substitution induces 
the formal power series 

=Eg(z,77), 

which we regard as being parameterized by rj. Obviously, setting rj = 1 we recover Dg(z), i.e. we have 
Dg(z) — Dg.i(z) = Eg(z, l). Notc that for 7] > 1/4, the polynomial 772^ — z -f- 1 has no real root. Thus we 
have for 7] > 1/4 the asymptotics 

dfl,r,(n)~a5,^n3(5-i)^« d;,,(7^) ^ fcg,^7^3(<,-i)^„^ (7) 
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with identical exponential growth rates as long as the supercritical paradigm |41) applies, i.e. as long as 7^, 
the real root of minimal modulus of 

77 \ 1 



(7? z2 - 2 + l)2y 4' 

is smaller than any singularity of -p^^rzj^- In this situation 77 affects the constant ,j and the exponential 
growth rate 7^ but not the sub-exponential factor ri'^'^"'^ The latter stems from the singular expansion 
of Cg{z). Analogously, we derive the ?7-parameterized family of GF D*^(z) = 'E*(z,r]). Assuming a 
random sequence has on average a probability at most 6/16 to form a base pair we fix in the following 
7] = 6e/16 ~ 1.0125, where e is the Euler number. By abuse of notation we will omit the subscript j] 
assuming 77 = 6e/16. 

The main result of this section is that the set of A*-candidates is small. To put this size into context we 
note that the total number of entries considered for the A*-decomposition rule is given by 

n 

fl{n) = ^ (n - m + 1). 

771—1 

Theorem 3. Suppose an mfe g-structure over an interval of length m is irreducible with probability 
d*g{rn) / Ag{m) , then the expected number of candidates of g- structures for sequences of lengths n satisfies 

and furthermore, setting Kg (n) ~ Kg{n)/n{n) we have 

Egin) ^ d*g{n)/dg{n) ^ bg, 

where bg > is a constant. 

We provide an illustration of Theorem [3] in Figure [SI 

Proof. We proof the theorem by quantifying the probability of [i,j] being a A*-candidate. In this case any 
(not necessarily unique) sub-structure, realizing the optimal solution Lij, is A*-irreducible, and therefore 
an irreducible structure over 

Let m = (j — i + 1), by assumption, the probability that [i,j] is a candidate conditional to the existence 
of a substructure over [i,j] is given by 

d!(m) 

P*([7,j] I [i,j] is a candidate ) = — r-, (8) 
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Eo(n) 
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(A) 



300 400 

n 



E,(n) 



10 20 30 40 50 60 70 80 30 100 



(B) 



Figure 9: The expected number of candidates for secondary and 1-structures, Eo(n) and Ei(n): we compute 
the expected number of candidates obtained by folding 100 random sequences for secondary structures (A) (solid) 
and 1-structures (B) (solid). We also display the theoretical expectations implied by Theorem |3] (A) (dashed) and 
(B) (dashed). 



Note that P, ([i, | is a candidate ) does not depend on the relative location of the interval but only on 
the interval- length. Let Fg(m) — d* (m)/dg (to) , then according to Theorem [U 

(l-e)a3TO='(s-^)y™ < d^(^) < [i + e)agm'^'^s-^^-f"', 



(l-e)fcgTO3(^'-5)y" < d*(TO) < (l + e)fcg 



^3(3- 



for TO > mo where toq > and < e < 1 are constants. On the one hand 

d;(m) ^ (l + e)agTO3(9-3)7" 



P,(m) 



dg(TO.) (1 - e)fcgTO3(9~5)7™ 



^(1 + ,')^^ (! + ,')&„ 



where bg = CLg/kg > is a constant. On the other hand, we have 



dgim) - (l + e)kgm^^3-i)jm 



= (l-e")&. 

Kg 



(9) 



(10) 



Setting e = max{e',e"}, we can conclude that Pg(TO) ^ d*(m)/dg(TO), see Fig. (TU] 

We next study the expected number of candidates over an interval of length m. To this end let 

Xm = I [hj] is a A*-candidate of length m}\. 

The expected cardinality of the set of A*-candidates of length to = (j — i + l) encountered in the DP-algorithm 
is given by 



Eg(X,„) < {n^{m~l))¥g{m), 
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(B) 



Figure 10: The probability distribution of Po(m) (A) and Pi(m) (B). 

since there are n — (m — 1) starting points for such an interval Therefore, by hnearity of expectation, 

for sufficiently large m > mo, Pg(m) < (1 + with e being a small constant. Thus we have 



Eg(n) =Eg(^X™) < ^(n-?n+l)Pg(m) + (l + e)6g ^ (n-m + 1). 



(11) 



m— 1 rn—nii) 
,2\ 



Consequently, the expected size of the A*-candidate set is 8(n ). We proceed by comparing the expected 
number of candidates of a sequence with length n with 



n{n) 



< 



< (l + e)6g + 



ErLl("-'^+ 1) 

E:1i(ip.M-(i + ^)^.)(«-"^ + i) 



Em=l("-"^+ 1) 



k ■ n 

< {l + e)bg + —. 



For sufficient large n> no, Eg(n)/0(n) < (1 + e')6g. Furthermore 



> 



> (1 - 



n{n) - E:=i("-"i + i) 

from which we can conclude Eg(n)/17(n) ^ d*(TO)/dg(m) ^ 6g and the theorem is proved 



□ 



Loop-based energies 

In this section we discuss the more realistic loop-based energy model of RNA secondary structure folding. To 
be precise we evoke here instead of two trivariate GFs F(z, t, v) and F*(z, t, v) counting secondary structures 
over n vertices that filter energy and arcs. 

This becomes necessary since the loop-based model distinguishes between arcs and energy. The "cance- 
lation" effect or reparameterization of stickiness |33j to which we referred to before does not appear in this 
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context. Thus we need both an arc- as weU as an energy- filtration. 

A further complication emerges. In difference to the GFs Eg(z,t) and E*{z,t) the new GFs are not 
simply obtained by formally substituting [tz^/^itz^ — z + 1)^) into the power scries Dg(z) and '0*g{z) as 
bivariate terms. The more complicated energy model requires a specific recursion for irreducible secondary 
structures. 

The energy model used in prediction secondary structure is more complicated than the simple arc-based 
energy model. Loops which are formed by arcs as well as isolated vertexes between the arcs are considered 
to give energy contribution. Loops are categorized as hairpin loops (no nested arcs), interior loops (including 
bulge loops and stacks) and multi- loops (more than two arc nested), see Figure [TTJ An arbitrary secondary 
structure can be uniquely decomposed into a collection of mutually disjoint loops. A result of the particular 
energy parameters [J is that the energy model prefers interior loops, in particular stacks (no isolated vertex 
between two parallel arc), and disfavors multi- loops. Base on this observation, wc give a simplified energy 
model for a loop A contained in secondary structure by 

• /(A) = —0.5 if ^ is a hairpin loop, 

• /(A) = 1 if £ is an interior loop, 

• /(A) = —5 if £ is a multi-loop, 

where A is a loop. The weight for a secondary structure 5 accordingly is given by 

f{5)=Y.f{X). (12) 




(A) (B) (C) 

Figure 11: Diagram representation of loop types: (A) hairpin loop, (B) interior loop, (C) multi-loop. 

Let Fq(z) and 'Fq{z) be the GFs obtained by setting t ~ e and v = 6/16 in F*{z,t,v) and F{z,t,v), 
where e is the Eulcr number. This means we find a suitable parameterization which brings us back to a 
simple univariate GF. 
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Lemma 3. The weight function of RNA secondary structures, Fq{z), satisfies 



f;(=, = + ±eV + ^.-Vp^' (13, 

16 1 — 2 16 \l — z J 16 1 — Fq(z)yi^1 — z 

and F*(z) is uniquely determined by the above equation. Furthermore 

Proof. We first consider the GF Fq(z) wliose coefficient of z" denotes tlie totai weigtit of irreducible secondary 
structures over n vertexes, wliere (l,n) is an arc. Thus it gives a term 6/162:^. Isoiated vertex lead to the 
term 



00 , 



z'' •> z' = zP 



i=0 

where p denotes the minimum number of isolated vertexes to be inserted. Depending on the types of loops 
formed by (i, n), we have 

• hairpin loops: j^, 

• interior loops: F^{z) ( 



• multi-loops: there are at least two irreducible substructures, as well as isolated vertices, thus 
We compute 



\ 



1-z \l-zj ' ' i„F5(2)^l-z ' 



which establishes the recursion. The uniqueness of the solution as a power series follows from the fact that 
each coefficient can evidently be recursively computed. 

An arbitrary secondary structure can be considered as a sequence of irreducible substructure with certain 
intervals of isolated vertexes. Thus 



1 °° 1 1 1 

= V F*(z) = r^-^. 

^' l-z^l-z^^' l-zl-F*z-r^ 



.=0^ - ^ -^-fsWt^ 

□ 
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Lemma 4. Fg(z) and Fo(2;) have the same singular expansion. 




an 2 



27", and fo(n) ^ (3n 



(15) 



where a « 0.24 and (3 « 2.88 are constants and 7 « 2.1673 

Proof. Solving eq. [13] we obtain a unique solution for Fq(z) whose coefficient are all positive. Observing 
the dominant singularity of Fq(z) it is p w 0.4614. Fo(2;) is a function of Fq(z) and we examine the real 
root of minimal modulus of 1 — Fq(z)y^ = is bigger than p. Then by the supercritical paradigm [41] 
applying, Fo(z) and Fg(z) have identical exponential growth rates. Furthermore, 'Pq{z) and Fo(z) have the 
same sub-exponential factor n^i , hence the lemma. □ 

Theorem 4. Suppose an mfe secondary structure over an interval of length m is irreducible with probability 
Po(to) = Y17^' ^'^fi'^ expected number of candidates for sequences of lengths n is 



where b = a//3 w 0.08. 

Proof. By Lemma|4]we have fg (TO)/fo(m) b where 6 is a constant. The proof is completely analogous to 



We show the distribution of Pq {m) and Eq [n] in Figure [TH 

Results and Discussion 

In this paper we quantify the effect of sparsification of the particular decomposition rule A* . This rule splits 
and interval and thereby separates concatenated substructures. The sparsification of A* alone is claimed 
to provide a speed up of up to a linear factor of the DP-folding of RNA secondary structures [24]. A 
similar conclusion is drawn in [26] where the sparsification of RNA-RNA interaction structures is shown 
to experience also a linear reduction in time complexity. Both papers [241126] base their conclusion on the 
validity of the polymer-zeta property discussed in Section Sparsification. 

For the folding of pseudoknot structures there may however exist non-sparsifiable rules in which case 
the overall time complexity is not reduced. The key object here is the set of candidates and we provide an 



Eo{n) = e{n^) 



and furthermore, setting Kg (n) ~Eg{n)/n{n), we have 



Eo(n) ^f*(n)/fo(n) 



that of Theorem [3] 



□ 
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Figure 12: The distribution of Po(m) (A) and Eo(n) obtained by folding 100 random sequences on the loop-based 
model (B) (solid), as well as the theoretical expectation implied by Theorem |4] (B) (dashed). 

analysis of A*-candidatcs by combinatorial means. In general, the connection between candidates, i.e. unions 
of disjoint intervals and the combinatorics of structures is actually established by the algorithm itself via 
backtracking; at the end of the DP-algorithm a structure is being generated that realizes the previously 
computed energy as mfe-structure. This connects intervals and sub-structures. 

So, does polymer-zeta apply in the context of RNA structures? In fact polymer-zeta would follow if the 
intervals in question are distributed as in uniformly sampled structures. This however, is far from reasonable, 
due to the fact that the mfe-algorithm deliberately designs some mfe structure over the given interval. What 
the algorithm produces is in fact antagonistic to uniform sampling. We here wish to acknowledge the help 
of one anonymous referee in clarifying this point. 

Our results clearly show that the polymer-zeta property, i.e. the probability of an irreducible structure 
over an interval of length m satisfies a formula of the form 

P(there exists an irreducible structure over [1, m]) = bm^^'^, where b,c> 0. (16) 

does not apply for RNA structures. The theoretical findings from self-avoiding walks [3D] unfortunately do 
not allow to quantify the expected number of candidates of the A*-rule in RNA folding. 

That the polymer-zeta property does not hold for RNA has also been observed in the context of the limit 
distribution of the 5'-3' distances of RNA secondary structures [31]. Here it is observed that long arcs, to 
be precise arcs of lengths 0{n) always exist. This is of course a contradiction to eq. (jl6p . 

The key to quantification of the expected number of candidates is the singularity analysis of a pair of 
energy-filtered GF, namely that of a class of structures and that of the subclass of all such structures that 
are irreducible. We show that for various energy models the singular expansions of both these functions 
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are essentially eguaZ-modulo some constant. This implies that the expected number of candidates is 0(n^) 
and all constants can explicitly be computed from a detailed singularity analysis. The good news is that 
depending on the energy model, a significant constant reduction, around 95% can be obtained. This is in 
accordance with data produced in [25j for the mfe-folding of random sequences. There a reduction by 98% 
is reported for sequences of length > 500. 

Our findings are of relevance for numerous results, that are formulated in terms of sizes of candidate 
sets [57|. These can now be quantified. It is certainly of interest to devise a full fledged analysis of the 
loop-based energy model. While these computations are far from easy our framework shows how to perform 
such an analysis. 

Using the paradigm of gap-matrices Backofen has shown [27] that the sparsification of the DP-folding 
of RNA pseudoknot structures exhibits additional instances, where sparsification can be applied, see Fig. |6] 
(B). Our results show that the expected number of candidates is O(n^), where the constant reduction is 
around 90%. This is in fact very good new since the sequence length in the context of RNA pseudoknot 
structure folding is in the order of hundreds of nucleotides. So sparsification of further instances does have 
an significant impact on the time complexity of the folding. 



Proofs 

In this section, we prove Lemma [5] and Theorem [51 



Proof for Lemma [2} let D(2,u) and 'D*{z,u) be the bivariate GF T){z,u) = J2n>o^g=o'^a(^)'^""'^^ ^ 
and D*(z, u) = X]ri>i X]i=o d*(ri)z"u^. Suppose a structure contains exactly j irreducible structures, then 



j>0 



and 



D*(z) = K]D*(z,u) = -K]: 



5>1, 



(17) 



(18) 



'D(2,U)' 

as well as D*,{z) = 1 - Klp^^- Let F(z,u) = E„>o E<,>o W^""' = wh)- Then F(z,u)T){z,u) = 1, 
whence for g > 1, 

9 

J2 Fg, {z)-Dg.g, (z) = [uf ]F(z, u)T>{z, u) = 0, (19) 

31=0 

and Fo(z)Do(z) = 1, where Fg{z) = X]n>o ^sl"")^" ^ [ii^]F(z,u) = [u^] ■ Furthermore, we have 

Fo(2) = and 

ErioF,i(^)D,_,,(z) 



Do(z) 



(20) 
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which imphcs Dq(z) = 1 — Fo(z) = 1 — pj"^,^ and 

(D*(z) - l)D„(z) + D* (z)D„_„, (z) 

Proof for Theorem [2] Let [n]k denotes the set of compositions of n having k parts, i.e. fora e [n]k we 
have fj = ((Ti, . . . , CTfe) and X^Li '^^ ^ 

We shall prove the claim by induction on g. For g — I we have 

D*(.t) = ^i^, (23) 

whence eq. ([2^ holds for 5 = 1. By induction hypothesis, we may now assume that for j < g, eq. (|22p holds. 
According to Lemma [21 we have 

(DS(z) - l)D3+i(z) + Eg.^i {z)-Da+i~g^ jz) 



Do(z) 

Sl-2 



_ f D,.(z) (-1)^^+^-^- ^ Wd M D,, (■ 

We next observe 



-Ld^D9+i-9i(^)= Do(z)9+W) 22 11 

51=1 ' \<T'e[s+i],+i_(,_i, j=i 

and setting h — gi — j we obtain, 

9 91-2 , -.xgi + l-j / 91 -J \ 

-EE D (^)9i+2-j E n°'^.(^)p^9+l-9l(^) 

9 91 (_-[^'j/i+2 / 



(24) 



EE^^ E nD.(^) D,,,_,,(.) 



91=1 h=2 yo-e[gi]hi=l 



Es^ E E nD.(.) D,,,_,,(.) 

9 / iVi+2 / h+1 



/i=2 ' \<T'e[g+l]h+i i=l 

and setting j ~ g ~ h 
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Consequently, the Claim holds for any g > 1. 
For any g > 1, we have [53] 

1 Pg{u) 



Z2 - Z + 1 (1 - 4M)3ff-l/2 ' ^ z2 - Z + 1 (1 + yi~4^) ' 

where Pg(it) is a polynomial with integral coefficients of degree at most {3g—l), Pg(f/4) ^ 0, [u2s]Pg(M) ^ 

, 2 

and [u ]Pg{u) = for < h < 2g — 1. Let u = jp-^^^rrp'^ the Claim provides in this context the following 
interpretation of T)*{z) 



(25) 



and 



2 ) (i_4u)3f-^ 



^ ^2 ^ ^ _ ^-^ 

As < j < 5 — 2 and 5 — j < s < 2(7 + f — 2j, we have s > 2. Consequently we arrive at 

I D^L-)- I (26) 



where 



and 



s is odd ^ ^ 



We have for cr G [g]k, k > 1 
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where X]i=i — h, hi > 0. Then we obtain that 



fe 



[A I] nP-'(") = 0. < /i < 2g - 1. (27) 

\<Te[3]fc»=i / 

Since [u''']P^, (u) ^ 0, hi < 2a, - 1, [u^'^']P^^{u) ^ and X^Li '^i = 9- Thus for < /i < 2g - 1, 

[M'']Ug(u) = and [u^]Vg{u) = 0. (28) 

As shown in |23] we have 

p , .^ _ r(g-l/6)r(.g + l/2)r(g + l/6)9M-3 

^^^^/^^ 67r3/2r(g + l) (^^^ 

and we obtain Ug(l/4) = Pg(l/4)/4. Furthermore, 

V,(l/4) = ^4^+f-iy( ^ nP..(l/4)l =i |4P,(l/4)-£p,(l/4)P,_,(l/4)l ^0. 



i^(tG[3]2*=1 / \ J = l 



We can recruit the computation of in order to observe 4Pg(l/4) — J2^j=i Pj(l/4)Pg-j(l/4) ^ 0. In 
order to compute the bivariate GF, E*(z,t), we only need to replace in eq. ([22]) l^g{z) by Eg(z,t) and the 
proof is completely analogous. 
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Figures 

Figure 1 - RNA structures as planar graphs and diagrams 

(A) an RNA secondary structure and (B) an RNA pseudoknot structure. 

Figure 2 - Sparsification of secondary structure folding 

Suppose the optimal solution Li j is obtained from the optimal solutions Lij^, ifc+i,g and Lg^i j. Based on 
the recursions of the secondary structures, Li_k and Lk+i,q produce an optimal solution of Li q. Similarly, 
Lk+i,q and i<j+i.j produce an optimal solution of Lk+i.j. Now, in order to obtain an optimal solution of 
Li_j it is sufficient to consider either the grouping Li^q and ig+i.j or Li^k and Lk+ij. 
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Figure 3 - What sparsification can and cannot prune 

What sparsification can and cannot prune: (A) and (B) arc two computation paths yielding the same 
optimal solution. Sparsification reduces the computation to path (A) where Si,ki is irreducible. (C) is 
another computation path with distinct leftmost irreducible over a different interval, hence representing a 
new candidate that cannot be reduced to (A) by the sparsification. 

Figure 4 - Sparsification 

Sparsification: Ly is alternatively realized via L„j and Ly'_^, or L^j and Ly^. Thus it is sufficient to only 
consider one of the computation paths. 

Figure 5 - The recursion solving the optimal solution for secondary structures 

The recursion solving the optimal solution for secondary structures. 

Figure 6 - Decomposition rules for pseudoknot structures of fixed genus 

(A) three decompositions via the rule A*, which is s-compatiblc to itself. We show that for A* wc obtain a 
linear reduction in time complexity. (B) three decomposition rules Ai, A2, A3 where A2, A3 are s-compatible 
to Ai. A quantification of the candidate set is not implied by the polymer-zeta property. (C) three decom- 
position rules Ai, A2, A3 where A2, A3 are not s-compatible to Ai. 

Figure 7 - RNA structures and diagram representation 

A diagram over {1, . . . , 40}. The arcs (1, 21) and (11, 33) are crossing and the dashed arc (9, 10) is a 1-arc 
which is not allowed. This structure contains 3 stacks with length 7, 4 and 6, from left to right respectively. 

Figure 8 - Irreducibility relative to a decomposition rule 

the rule A* splitting Sij to Si,k and Sk+i.j, Siao is not A*-irreducible, while S'1,25 and S'28.40 are. However, 
for the decomposition rule A2, which removes the outmost arc, 5*28,40 is not A2-irreducible while 51,25 is- 

FigureQ -The expected number of candidates for secondary and 1-structures Eo(ri) and Ei(n) 
we compute the expected number of candidates obtained by folding 100 random sequences for secondary 
structures (A) (solid) and 1-structures (B) (solid). We also display the theoretical expectations implied by 
Theorem El (A) (dashed) and (B) (dashed). 
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FigurelO- The probability distribution of Po(m) and Pi(to) 
The probability distribution of Po(m) (A) and Pi (to) (B) 

Figurell -Diagram representation of loop types 

(A) hairpin loop, (B) interior loop, (C) multi-loop. 

Figurel2 -The distribution of Po(m) (A) and Eo(n) 

The distribution of Po(to) (A) and Eo(ri) obtained by folding 100 random sequences on the loop-based model 

(B) (solid), as well as the theoretical expectation implied by Theorem 2] (B) (dashed). 
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